2. Next 7-day forecasting of COVID-19 cases EDA

2.1 Contents

2.2 Imports

2.3 Functions

2.3.1 Function: string to float

2.3.2 Function: visualize cases in different countries

2.3.3 Function: visualize mobility data and cases

2.3.4 Function: get feature correlations

2.4 Load data

2.5 Explore the data

Note: our dataset comprises data from 23 countries (iso 2-digit country codes: 'AR', 'AU', 'AT', 'BE', 'CA', 'DK', 'FI', 'FR', 'DE', 'IN', 'ID', 'IE', 'IL', 'IT', 'JP', 'KR', 'MX', 'NL', 'NO', 'RU', 'SG', 'GB', 'US'). There is a dataframe for each country that contains various features based upon which the future (cases per million for the next 7 days) COVID-19 cases of each country would be forecased. To undertand how data are organized, let's take a look at US data as an example.

First of all, there is no NaN or missing values in all columns, which has been successfully dealt with during the previous data wrangling stage. Are all data in proper types? The first group of feature variables comprise google mobility data ('rtrc': retail and recreation, 'grph': grocery and pharmacy, 'tran': transportation, 'work': workplace, 'resi': residential area) are all properly in float64 format. Vaccination per million ('vac') is also in float64. However, we note that the weather-related features ('cloudcover': cloud coverage, 'tempC': temperature in celcius, 'humidity': humidity, 'precipMM': precipitation in millimeters) are in string format, which must be tranformed to numeric values to be used in our prediction models. The rest of columns - 'holiday', 'dayow' (day of the week) are properly in int64.

To identify appropriate models for our forecasting, one needs to understand the time-evolving patterns of different time series, i.e., the characteristic temporal patterns of the time series, which could provide some insights as to what features are more relevant with respect to the target variable (the number of COVID-19 cases), what are correlated. Before we dive into the covariance matrix yet, some visualizations would be useful here.

2.5.1 Explore the data: plot mobility, vaccination, cases

By plotting the mobility time series and the number of cases on the same plot we first could appreciate that the mobility in different categories displayed characteristic time-evolving patterns, implicating that the data are enriched and potentially informative in forecasting the number of COVID cases. For instance, it is interesting to see that the mobility in parks (yellow) shows interesting negative correlation with the number of cases (white) in US. Our forecasting can leverage such temporal relationship to enhance the model's predictability. Secondly, we observe a robust periodicity in all data, and their frequency of fluctuation is quite obviously weekly. This is an expected pattern as the amount of human traffic at places, for instance, workplaces should vary dramatically during weekdays vs weekends. The impact of such nonlinearity in our model performance is not clear at this point. However, given the observation that the weekly oscillation also exists in the target variable (COVID cases), it could also be used to leverage the model's prediction power, if it could be properly modeled. In addition, widespread periodicity both in features and target implicates that our model would need to possess some complexity to capture the periodicity and dynamic time evoloving patterns, and thus simple approaches like linear regression would perhaps not be the best model choice.

2.5.2 Explore the data: plot timecourses of COVID-19 cases in 5 countries

Importantly, we also need to decide how to treat 23 different countries. Do data from different countries exhibit similar temporal patterns? We already know that is not the case. Countries have come through wildly different epidemiological footprints due to numerous variables that differed across countries. To showcase this international discrepancy, the number of cases over time is plotted for 5 different countries (for simplicity) below.

As can be seen from above, countries show quite different time-evolving patterns in their COVID-19 cases. This suggests that a unifying model that pools data across different countries would be suboptimal - i.e., models would need to treat different countries differently, in other words, contry ID should be a critical feature in our model predicting the future COVID-19 cases.

2.5.3 Explore the data: visualize feature correlations

Ok, we've learned that countries differ in COVID-19 cases over time, and so the model should each country differently. Now we ask seek to gain more insights on features and their relevance to the target variable (COVID-19 cases). As a first pass analysis, we turn to the correlation matrix to reveal any linear relationship among features and between feature and the target. To get a general idea first, we pool the data across all countries and compute Pearson's correlations.

2.5.3.1 Explore the data: compute the correlation matrix

2.5.3.2 Explore the data: visualize the correlation matrix

2.5.3.3 Explore the data: visualize the pairwise correlation

From the correlation matrix and its visualizations, we've observed interesting relationships among features and between features and the target.

2.5.3.4 Explore the data: time-evolving correlations between vaccination and COVID-19 cases

As speculated above, the correlations between vaccinations and cases change over time. The correlation started as negative perhaps reflecting the efficacy of vaccines in preventing infections, however the relationship was reversed in recent data, which might be due to the surge of infections from emerging delta variants and other factors like lifting of social distancing measures in many areas. This highlights the complexity in time-evolving patterns of feature time series in this dataset. Thus, models that can capture such complext patterns would be preferred - which is a vote for ensemble methods based on decision trees (e.g. XGBoost) or deep neural networks (ideally the ones with recurrent architecture, e.g., LSTM).

2.6 Summary

From a series of exploratory analyses, we have learned,